Goto

Collaborating Authors

 Lake County


A Targeted Learning Framework for Estimating Restricted Mean Survival Time Difference using Pseudo-observations

Jin, Man, Fang, Yixin

arXiv.org Machine Learning

A targeted learning (TL) framework is developed to estimate the difference in the restricted mean survival time (RMST) for a clinical trial with time-to-event outcomes. The approach starts by defining the target estimand as the RMST difference between investigational and control treatments. Next, an efficient estimation method is introduced: a targeted minimum loss estimator (TMLE) utilizing pseudo-observations. Moreover, a version of the copy reference (CR) approach is developed to perform a sensitivity analysis for right-censoring. The proposed TL framework is demonstrated using a real data application.


STREETS: A Novel Camera Network Dataset for Traffic Flow

Corey Snyder, Minh Do

Neural Information Processing Systems

In this paper, we introduce STREETS, a novel traffic flow dataset from publicly available web cameras in the suburbs of Chicago, IL. We seek to address the limitations of existing datasets in this area. Many such datasets lack a coherent traffic network graph to describe the relationship between sensors.


Estimand framework and intercurrent events handling for clinical trials with time-to-event outcomes

Fang, Yixin, Jin, Man

arXiv.org Machine Learning

The ICH E9(R1) guideline presents a framework for clinical trials to align planning, design, conduct, analysis, and interpretation (ICH, 2020). The three key steps in the framework are: estimand, estimator, and sensitivity analysis (Mallinckrodt et al., 2020). ICH E9(R1) highlights the importance of dealing with intercurrent events (ICEs), which are defined as: "Events occurring after treatment initiation that affect either the interpretation or the existence of the measurements associated with the clinical question of interest. It is necessary to address intercurrent events when describing the clinical question of interest in order to precisely define the treatment effect that is to be estimated." ICH E9(R1) proposes five strategies for dealing with ICEs in clinical trials with quantitative outcomes and categorical outcomes: treatment policy strategy, hypothetical strategy, composite variable strategy, while-on-treatment strategy, and principal stratum strategy.


Solar Irradiation Forecasting using Genetic Algorithms

Gunasekaran, V., Kovi, K. K., Arja, S., Chimata, R.

arXiv.org Artificial Intelligence

Renewable energy forecasting is attaining greater importance due to its constant increase in contribution to the electrical power grids. Solar energy is one of the most significant contributors to renewable energy and is dependent on solar irradiation. For the effective management of electrical power grids, forecasting models that predict solar irradiation, with high accuracy, are needed. In the current study, Machine Learning techniques such as Linear Regression, Extreme Gradient Boosting and Genetic Algorithm Optimization are used to forecast solar irradiation. The data used for training and validation is recorded from across three different geographical stations in the United States that are part of the SURFRAD network. A Global Horizontal Index (GHI) is predicted for the models built and compared. Genetic Algorithm Optimization is applied to XGB to further improve the accuracy of solar irradiation prediction.


Beyond Static Retrieval: Opportunities and Pitfalls of Iterative Retrieval in GraphRAG

Guo, Kai, Dai, Xinnan, Zeng, Shenglai, Shomer, Harry, Han, Haoyu, Wang, Yu, Tang, Jiliang

arXiv.org Artificial Intelligence

Retrieval-augmented generation (RAG) is a powerful paradigm for improving large language models (LLMs) on knowledge-intensive question answering. Graph-based RAG (GraphRAG) leverages entity-relation graphs to support multi-hop reasoning, but most systems still rely on static retrieval. When crucial evidence, especially bridge documents that connect disjoint entities, is absent, reasoning collapses and hallucinations persist. Iterative retrieval, which performs multiple rounds of evidence selection, has emerged as a promising alternative, yet its role within GraphRAG remains poorly understood. We present the first systematic study of iterative retrieval in GraphRAG, analyzing how different strategies interact with graph-based backbones and under what conditions they succeed or fail. Our findings reveal clear opportunities: iteration improves complex multi-hop questions, helps promote bridge documents into leading ranks, and different strategies offer complementary strengths. At the same time, pitfalls remain: naive expansion often introduces noise that reduces precision, gains are limited on single-hop or simple comparison questions, and several bridge evidences still be buried too deep to be effectively used. Together, these results highlight a central bottleneck, namely that GraphRAG's effectiveness depends not only on recall but also on whether bridge evidence is consistently promoted into leading positions where it can support reasoning chains. To address this challenge, we propose Bridge-Guided Dual-Thought-based Retrieval (BDTR), a simple yet effective framework that generates complementary thoughts and leverages reasoning chains to recalibrate rankings and bring bridge evidence into leading positions. BDTR achieves consistent improvements across diverse GraphRAG settings and provides guidance for the design of future GraphRAG systems.


Body-terrain interaction affects large bump traversal of insects and legged robots

Gart, Sean W., Li, Chen

arXiv.org Artificial Intelligence

Sm all animals and robots must often rapidly traverse large bump - like obstacles when moving through complex 3 - D terrains, during which, in addition to leg - ground contact, their body inevitably come s into physical contact with the obstacl es. However, we know little about the performance limits of large bump traversal and how body - terrain interaction affects traversal . To address these, we challenged the discoid cockroach and a n open - loop six - legged robot to dynamically run into a large bump of varying height t o discover the maximal traversal performance, and studied how locomotor modes and traversal performance are affected by body - terrain interaction . Remarkably, d uring rapid running, both t he animal and the robot were cap able of dynamically traversing a bump much higher than its hip height ( up to 4 times the hip height for the animal and 3 times for the robot, respectively) at traversal speeds typical of running, with decreasing traversal probability with increasing bump height. A stability analysis using a novel locomotion energy landscape model explained why traversal was more likely when the animal or robot approach ed the bump with a low initial body yaw and a high initial body pitch, and why deflection was more likely otherwise . Inspired by these principl es, we demonstrated a novel control strategy of active body pitch ing that increase d the robot's maximal traversable bump height by 75%. Our study is a major step in Bioinspiration & Biomimetics (2018), 13, 02600 5; htt ps://li.me.jhu.edu 2 establishing the framework of locomotion energy landscapes to understand locomotion in complex 3 - D terrains .


STREETS: A Novel Camera Network Dataset for Traffic Flow

Corey Snyder, Minh Do

Neural Information Processing Systems

In this paper, we introduce STREETS, a novel traffic flow dataset from publicly available web cameras in the suburbs of Chicago, IL. We seek to address the limitations of existing datasets in this area. Many such datasets lack a coherent traffic network graph to describe the relationship between sensors.


Generalised Label-free Artefact Cleaning for Real-time Medical Pulsatile Time Series

Chen, Xuhang, Olakorede, Ihsane, Bögli, Stefan Yu, Xu, Wenhao, Beqiri, Erta, Li, Xuemeng, Tang, Chenyu, Gao, Zeyu, Gao, Shuo, Ercole, Ari, Smielewski, Peter

arXiv.org Artificial Intelligence

Artefacts compromise clinical decision-making in the use of medical time series. Pulsatile waveforms offer probabilities for accurate artefact detection, yet most approaches rely on supervised manners and overlook patient-level distribution shifts. To address these issues, we introduce a generalised label-free framework, GenClean, for real-time artefact cleaning and leverage an in-house dataset of 180,000 ten-second arterial blood pressure (ABP) samples for training. We first investigate patient-level generalisation, demonstrating robust performances under both intra- and inter-patient distribution shifts. We further validate its effectiveness through challenging cross-disease cohort experiments on the MIMIC-III database. Additionally, we extend our method to photoplethysmography (PPG), highlighting its applicability to diverse medical pulsatile signals. Finally, its integration into ICM+, a clinical research monitoring software, confirms the real-time feasibility of our framework, emphasising its practical utility in continuous physiological monitoring. This work provides a foundational step toward precision medicine in improving the reliability of high-resolution medical time series analysis


Achieving Operational Universality through a Turing Complete Chemputer

Gahler, Daniel, Thomas, Dean, Lach, Slawomir, Cronin, Leroy

arXiv.org Artificial Intelligence

The most fundamental abstraction underlying all modern computers is the Turing Machine, that is if any modern computer can simulate a Turing Machine, an equivalence which is called Turing completeness, it is theoretically possible to achieve any task that can be algorithmically described by executing a series of discrete unit operations. In chemistry, the ability to program chemical processes is demanding because it is hard to ensure that the process can be understood at a high level of abstraction, and then reduced to practice. Herein we exploit the concept of Turing completeness applied to robotic platforms for chemistry that can be used to synthesise complex molecules through unit operations that execute chemical processes using a chemically-aware programming language, XDL. We leverage the concept of computability by computers to synthesizability of chemical compounds by automated synthesis machines. The results of an interactive demonstration of Turing completeness using the colour gamut and conditional logic are presented and examples of chemical use-cases are discussed. Over 16.7 million combinations of Red, Green, Blue (RGB) colour space were binned into 5 discrete values and measured over 10 regions of interest (ROIs), affording 78 million possible states per step and served as a proxy for conceptual, chemical space exploration. This formal description establishes a formal framework in future chemical programming languages to ensure complex logic operations are expressed and executed correctly, with the possibility of error correction, in the automated and autonomous pursuit of increasingly complex molecules.


Optimal Survey Design for Private Mean Estimation

Chen, Yu-Wei, Pasupathy, Raghu, Awan, Jordan A.

arXiv.org Machine Learning

This work identifies the first privacy-aware stratified sampling scheme that minimizes the variance for general private mean estimation under the Laplace, Discrete Laplace (DLap) and Truncated-Uniform-Laplace (TuLap) mechanisms within the framework of differential privacy (DP). We view stratified sampling as a subsampling operation, which amplifies the privacy guarantee; however, to have the same final privacy guarantee for each group, different nominal privacy budgets need to be used depending on the subsampling rate. Ignoring the effect of DP, traditional stratified sampling strategies risk significant variance inflation. We phrase our optimal survey design as an optimization problem, where we determine the optimal subsampling sizes for each group with the goal of minimizing the variance of the resulting estimator. We establish strong convexity of the variance objective, propose an efficient algorithm to identify the integer-optimal design, and offer insights on the structure of the optimal design.